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ABSTRACT 

In previous works, D. J. Woodruff derived expressions 
for three different conditional test score variances: (1) the 

conditional standard error of prediction (CSEP) ; (2) the conditional 

standard error of measurement in prediction (CSEMP) ; and (3) the 
conditional standard error of estimation (CSEE) . He also presented 
step-up formulas that require only weak assumptions and allow the 
estimation of full-length test score conditional variances from two 
parallel half-length tests. This study empirically investigates the 
accuracy of the step-up formulas using real test data from 40,000 
examinees with scores on the ACT Assessment and concludes that the 
step-up formulas work fairly well for the CSEP and the CSEMP and less 
well for the CSEE. The CSEMP is also compared with two other 
procedures for estimating the conditional standard error of 
measurement. Appendixes present derivations and study the figures. 
(Contains two tables, eight appendix figures, and eight references.) 
(Author/ SLD) 
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An Empirical Investigation of the Accuracy of a Step-up Method 
for Estimating Test Score Conditional Variances 



Imelda C. Go 



David J. Woodruff 



Abstract 



In previous works, Woodruff derived expressions for three different conditional test 
score variances: the conditional standard error of prediction (CSEP), the conditional 
standard error of measurement in prediction (CSEMP), and the conditional standard 
error of estimation (CSEE). He also presented step-up formulas that require only 
weak assumptions and that allow the estimation of full-length test score conditional 
variances from two parallel half-length tests. This study empirically investigates the 
accuracy of the step-up formulas using real test data and concludes that the step-up 
formulas work fairly well for the CSEP and the CSEMP but less well for the CSEE. 

The CSEMP is also compared with two other procedures for estimating the 
conditional standard error of measurement (CSEM). 
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An Emprical Evaluation of the Accuracy of a Step-up Method 
for Estimating Test Score Conditional Variances 

The Standards for Educational and Psychological Testing (AERA, APA, & 
NCME, 1985) list as a secondary standard (one that is desirable but often not 
feasible) the recommendation that, conditional on critical score values, the 
standard error of measurement (SEM) be computed and reported. Under the 
classical test theory model, X = T+ E, the conditional standard error of 
measurement (CSEM) is defined as the conditional observed score (or error 
score) variance for a fixed value of true score, that is, <r 2 (XI T=t) = <t 2 (E I T=t). In 
practice, true scores are usually not known so methods have been developed 
that estimate cf 2 (EI X=x) in place of <t 2 (E I T=f). However, when the conditioning 
is on X rather than T, it can be shown that <t 2 (E IX) = cr 2 (TIX) = -o(T, E IX) for 
all values of X. Hence, cr 2 (EI X) is artificially constrained in a way that <t 2 (EI T) is 
not. Also, Woodruff (1990) shows that if the reliability of X is less them one, 
then y[cr 2 [E IX)] < fj[a 2 {E\ 7)], where y denotes expectation. Hence, on average, 
ct 2 (EIX) is underestimating <r 2 (EIT). Such considerations led Woodruff (1990, 
1991) to develop an alternative method for estimating conditional test score 
variances. The purpose of this paper is to empirically evaluate the accuracy of 
this alternative procedure and to compare the alternative procedure with other 
procedures. 

The Procedures 

Consider two classically parallel full-length tests, XI = T x + E x i with m x i 
items and X2 = T x + Exz with m x 2 items, both of which are administered to N 
examinees. It is shown in appendix A that it is reasonable to assume that 
o[T x , Exz 1X1) = 0 so that the following decomposition holds: 

<x 2 (X2 1X1) = <x 2 (r*IXl) + <t 2 (Ex2 1X1). (1.) 
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Woodruff (1990, 1991) calls <j 2 (X2IX1) the squared conditional standard error 
of prediction (CSEP), o 2 [Tx 1X1) the squared conditional standard error of 
estimation (CSEE), and o 2 {E x 2 1X1) the squared conditional standard error of 
measurement in prediction (CSEMP). All three of these conditional variances 
offer information about the accuracy of test scores at specific locations on the 
score scale, but it is the CSEMP that is most closely related to the CSEM. In 
Appendix A, it is shown that the average value of the CSEMP equals the 
average value of the CSEM, and this strongly supports the recommendation 
that the CSEMP be used as a substitute for the CSEM. Another advantage of 
using the CSEMP is that the CSEMP requires only the relatively weak 
assumptions of classical test theory. 

For each value of XI = 0, 1, 2 mx 1 , let the item scores for X2 be analyzed 

as a two-way persons (P) by measures (M) ANOVA with one observation per cell. 
In these conditional ANOVA’s, let MSp(X2 1X1) denote the main effect mean 
square for persons and let MS pm {X2 I XI) denote the persons by measures 
interaction mean square. Following Woodruff (1990, 1991) estimates for the 
three conditional variances are given by: 

(CSEP(X2 IX1)] 2 = s 2 (X2 1X1) = mx2MSp(X2 1X1), (2.) 

[CSEMP(X2 1 XI)] 2 = s 2 (Ex 2 I XI) = mx2MS PM {X2 I XI), and (3.) 

[CSEE(X2 1 XI)] 2 = s 2 (T x IX1) = s 2 (X2IX1) - s^EmlXl) 

= mx2(MSp(X2 1 XI) - MS pm (X 2 1 XI)] . (4.) 

In practice, scores on two full-length tests are rarely available. However, if a 
single full-length test can be divided into two parallel half-length tests, then 
estimates for the full-length test conditional variances can be obtained from the 
half-length test conditional variances by using the step-up formulas derived by 
Woodruff (1990, 1991). Suppose that the full-length test, X, can be divided into 
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two classically parallel half-length tests, Y1 = Ty + Ey 1 with my i items and Y2 
Ty + Ey 2 with my 2 items. Let the linear transformation 

X* = X*(Y1) = aYl + b (5.) 

rescale the half-length test, Y1 , to have the same mean and variance as the 
full-length test, X. The stepped-up estimates are: 



{[CSEP*[X(Y2)IX*(Y1)]) 2 = 



2(1 + 3r viv9 ) 



1 + r. 



Y1Y2± 

\2 



Y\Y2 



) 



m Y2 MS p [{Y 2 I X*(YT)], 



( 6 .) 



{[CSEMP*[X(y2) I X*(yi)]) 2 = 2m Y2 MS pM [Y2 I X*(T1) ], and 



(7.) 



{CSEE*[X(T2) I X*(yi)]) 2 = 
2(1 + 3 r y 1 y 2 ) 



m 



Y 2 



MS P [Y2 I X*(yi)] - 2MS pM [Y2 I X*(yi)] 



( 8 .) 



[ _ (l + r yiy2) 

In the preceeding three equations, the conditioning is on X*(Y1), the two mean 
squares are computed from a two-way ANOVA on the item scores for Y2, and 
the notation X(Y2) denotes that these half-length test mean squares have been 
stepped-up to full-length test mean squares. Finally, ryqy 2 denotes the sample 
correlation between Y1 and Y2. 

In what follows, reference will be made to stepped-up half-length test 
conditional standard deviations and to full-length test conditional standard 
deviations. The stepped-up half-length test conditional standard deviations, as 
given on the left side in equations (6.), (7.), and (8.), will always have asterisks 
as part of their name whereas the full-length test conditional standard 
deviations, as given on the left side in equations (2.), (3.), and (4.), will not. 
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There are at least two methods that estimate a 2 (ExlX) in place of cr 2 (Exl Tx). 
The first of these methods is the difference method due to Thorndike (1951). 
This method divides a single full-length test, X, into two parallel half-length 
tests, Y1 and Y2, and then calculates 

T-CSEM(ExlX) = s(Yl - Y2IX) (9.) 

as a substitute estimate for o(Exl Tx). Woodruff (1990) critically discusses the 
basis for this method. Another such method is presented by Feldt, Steffen, & 
Gupta (1985). This method is based on an ANOVA of the item responses of X. It 
substitutes as an estimate for oiE x \ Tx) the following estimate 

F-CSEM(E X I X) = [mxiMSpM IX)] 1 / 2 UO.) 

where (MSrmIX)] is a conditional interaction mean square from a measures by 
persons ANOVA of the item responses of X given a fixed value of X. 

The Empirical Investigation 

The data for this study was a random sample of 40,000 examinees with 
scores on the October 1986 ACT Assessment Program (American College Testing 
[ACT], 1987). The ACT Assessment Program (AAP) then consisted of 219 
dichotomously-scored items from four subtest areas: 75 from English, 40 from 
Mathematics, 52 from Social Studies, and 52 from Natural Sciences. Though 
data from 219 items were available, the goal was to divide the items into four 
parallel groups of items so the first three English items were eliminated. The 
remaining 216 AAP items were treated as an item pool from which parallel tests 
and half-tests could be constructed. In particular, four 54-item half-length 
tests were created and these were combined to yield two 108-item full-length 
tests. The four half-length tests were denoted Yl, Y2, Y3, and Y4. The two half- 
length tests Yl and Y2 were combined to yield the full-length test XI, and the 
two half-length tests Y3 and Y4 were combined to yield the full-length test X2. 
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All four half-length tests were carefully constructed to be balanced in content 
and to have similar test score statistics. The two full-length tests also were 
constructed to be balanced in content and to have similar test score statistics. 

The first step in constructing the four parallel half-length tests and the two 
parallel full-length tests was to compute the correlations between item position 
and item difficulty within each one of the four AAP subtests. Because the items 
within these four AAP subtests were ordered by item difficulty, negative 
correlations of -.36, -.88, -.56, and -.62 were found for the English, 
Mathematics, Natural Sciences, and Social Sciences AAP subtests, respectively. 
As a consequence, a systematic selection of the sub test items in their original 
test order was used. Table 1 shows the systematic item selection scheme for 
half-length tests Yl, Y2, Y3, and Y4. For example, to construct test Yl, the 1st 
out of every 4 English items, the 4th out of every 4 Mathematics items, the 3rd 
out of every 4 Social Studies items, and the 2nd out of every 4 Natural Sciences 
items were used. As a result, each one of the four parallel half-length tests had 
18 English items, 10 Mathematics items, 13 Social Studies items, and 13 
Natural Sciences items; and each one of the two parallel full-length tests had 
36 English items, 20 Mathematics items, 26 Social Studies items, and 26 
Natural Sciences items. The full-length tests and half-length tests were not 
homogeneous in content, but they were parallel in content. This illustrates an 
advantage of the current method, namely, an assumption of unidimensionality 
is not required. 

Tables 2 presents some relevant test score statistics for the two 108-item 
parallel full-length tests, XI and X2, and the four 54-item parallel half-length 
tests: Yl, Y2, Y3, and Y4. The statistics in Table 2 indicate that the two full- 
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Table 1. Syctpmatlc Item S<m>pling Scheme for Constructing Parallel Half-Length Tests. 



AAP Subtests 


Half-Leneth Test 


English 


Mathematics 


Social Studies 


Natural Sciences 


Y1 


1 


4 


3 


2 


Y2 


2 


3 


4 


1 


Y3 


3 


2 


1 


4 


Y4 


4 


1 


2 


3 



Table 2. Test Score Statistics for the Full-Length and Half-Length Tests. 





Mean 


SD 


Correlations. KR20’s. and Dissattenuated Correlations* 


Test 






XI 


X2 


Y 1 


Y2 


Y3 


Y4 


XI 


62.2 


16.8 


0.93 


0.93 


— 


— 


— 


— 


X2 


59.9 


16.7 


1.00 


0.93 


— 


— 


— 


— 


Y1 


31.4 


8.7 


— 


— 


0.87 


0.87 


0.87 


0.87 


Y2 


30.8 


8.7 


— 


— 


1.00 


0.87 


0.87 


0.87 


Y3 


29.5 


8.7 


— 


— 


1.00 


1.00 


0.87 


0.87 


Y4 


30.4 


8.6 


— 


— 


1.00 


1.00 


1.00 


0.87 



•Correlations are above the diagonal, KR20's are on the diagonal, and dissattenuated 
correlations are below the diagonal. 



6 



length tests have nearly identical test score statistics except for a modest 
difference between the means, and that should have little effect on the 
procedures under study. The same is true for the four half-length tests. 
Relevant correlations, KR20’s, and relevant dissattenuated correlations (using 
the KR20’s) are also presented in Table 2. These support the claim that the 
half-length tests and the full-length tests are indeed parallel. 

The full-length test score scale of 108 items was divided into intervals that 
comprised three score points starting with a score of 1 . These intervals had 

midpoints of 2, 5, 8 104, and 107. CSEP, CSEMP, and CSEE estimates 

using full-length tests XI and X2 were computed for each of these intervals 
except for some intervals at the bottom and top of the score scale that did not 
have a sufficient number of examinees for stable estimation. However, the 
expected guessing score on a 108-item test is 27 and the ACT Assessment 
Program (ACT, 1987) is designed so that few examinees obtain nearly perfect 
scores. Hence, the score interval midpoints of 26 through 96, for which stable 
CSEP, CSEMP, and CSEE estimates were obtained, covers the length of the 
score scale that the AAP was designed to most effectively measure. Two sets of 
such estimates were obtained: one conditioning on XI and the other 
conditioning on X2. 

Next, two sets of stepped-up half-length test estimates of the CSEP, CSEMP, 
and CSEE were computed using the two pairs of half-length tests (pair 1 : Y1 
and Y2, pair 2: Y3 and Y4) and the same three-point wide test score intervals. 
These stepped-up half-length test estimates of the CSEP, CSEMP, and CSEE 
were then compared to the full-length test estimates of the CSEP, CSEMP, and 
CSEE computed directly from XI and X2. In particular, the 
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CSEP*[X(Y2) IX*(Y1)] was compared to the CSEP(X2 1X1) and the 
CSEP*[X(Y4) IX*(Y3)] was compared to the CSEP(X1 1X2). Similar comparisons 
were made for the CSEMP and the CSEE. 

Figure 1 a in Appendix B is a graph of the stepped-up half-length test 
estimates: CSEP*[X(Y2) IX*(Y1)], CSEMP*[X(Y2) IX*(Y1)], and 
CSEE*[X(Y2) IX*(Y1)] along with the full-length test estimates: CSEP(X2 1X1), 
CSEMP(X2 1 XI), and CSEE(X2 1 XI). Figure lb in Appendix B is the same as 
Figure la except that quadratic polynomials were used to smooth the CSEP*, 
CSEE*, CSEP, and CSEE estimates. Figures 2a and 2b in Appendix B are 
analogous to Figures la and lb except that they compare the 
CSEP*[X(Y4) IX*(Y3)], CSEMP*[X(Y4) IX*(Y3)], and CSEE*[X(Y4) IX*(Y3)] 
estimates with the CSEP(X1 1X2), CSEMP(X1 1X2), and CSEE(X1 1X2) estimates. 

The CSEMP estimates also were compared to the CSEM estimates computed 
by the difference method (Thorndike, 1951) and the ANOVA method (Feldt et. 
al., 1985) using the same intervals of three score points that were used to 
compute the CSEMP estimates. Figure 3a is a graph of the 
CSEMP* [X(Y2) IX*(Y1)], the F-CSEM(E 1X1), the T-CSEM(E 1X1) estimates. 
Figure 3b is the same as Figure 3a except that the T-CSEM(E 1X1) estimates 
have been smoothed using a quadratic polynomial. Figures 4a and 4b are 
analogous to Figures 3a and 3b except that Figures 4a and 4b compare the 
CSEMP* [X(Y4) IX*(Y3)] estimates to the F-CSEM(EIX2) and T-CSEM(EIX2) 
estimates. 

Discussion 

The primary purpose of the present study was to evaluate the accuracy of 
the step-up procedure. How well the stepped-up half-length test estimates: 
CSEP*, CSEMP*, and CSEE*, approximate the full-length test estimates: CSEP, 
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CSEMP, and CSEE, can be seen in Figures 1 and 2. These figures indicate that 
the step-up procedure works very well for the CSEMP, fairly well for the CSEP, 
and less well for the CSEE. 

The secondary purpose of this paper was to compare the stepped-up half- 
length test estimate, CSEMP*, with the Feldt et. al (1985) and the Thorndike 
(1951) estimates of the CSEM, namely, F-CSEM and T-CSEM, respectively. 
Figures 3 and 4 show that the T-CSEM tends to be less than both the F-CSEM 
and the CSEMP*. Figures 3 and 4 also show that the F-CSEM tends to be less 
than the CSEMP* at both ends of the score scale but slightly greater than the 
CSEMP* in the middle of the score scale. These latter results agree with those 
found by Woodruff (1990). Because the average CSEMP equals the average 
CSEM, these results suggest that the T-CSEM is generally underestimating the 
CSEM, and that the F-CSEM may be slightly underestimating the CSEM at the 
ends of the score scale, but on average the F-CSEM appears closer to the 
CSEM than the T-CSEM. 

Finally, all of the half-length and full-length test scores in the present study 
had unimodal approximately symmetrical distributions so the results reported 
here do not necessarily generalize to other types of test score distributions. 
However, Woodruff (1990) does report some limited results for skewed test 
score distributions, and those results are similar to the ones reported here. 
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To show that o[Tx, Exz 1 -XI) = 0 first recall that XI = T x + E x 1 and X2 = T x + 
Exz are parallel measurements and that fi denotes expectation. The conditioned 
covariance betweeen T x and Exz given XI can be written as 

c ?(T x ,E X2 I XI) = Li[T x E x 2 I XI) - fi{T x I xmE X2 I XI). (Al) 

Using the double expectation theorem (DeGroot, 1989, p 220) on the first term 
on the right hand side of (Al) gives 

o{T x , E x2 1 X1 ) = M T X E X2 1 X1 > 1 T x ] ~ H{T X I xmE X2 I XI) 

(A2) 

= h[t x h(e X2 I T x ,Xl)] - a*(T x I X1)^(E X2 I XI). 

Making the assumption of linear experimental independence (Lord & Novick, 
1968, p 45) between Exz and -XI and between Exz and (XI, T x ) implies that 

H(E X2 I XI) = 0 for all values of XI and (A3) 

li[E X2 I XI, T x ) = 0 for all values of ( XI , T x ). (A4) 

Substituting (A3) and (A4) into (A2) yields the desired result: 
ct(T x ,E X2 I XI) = /i(T x 0) - fi{T x I X1)0 = 0. 

To show that /i[o{Exz 1X1)] = fi[a[Ex2 1 Tx)] note that by Theorem 2.6.2 of Lord 
& Novick (1968, p 35) 

ct 2 (E X2 ) = Mct 2 (E X2 I X1)] + ct 2 [a4E X2 I XI)] = h[<J 2 (E X2 I T x )] + cr 2 [^(E X2 I T x )]. 

It follows from the assumption that Exz is linearly experimentally independent 
of both XI and Tx that fi[Exz 1X1) = fi{Exz I Tx) = 0. Hence, the above becomes 

ct 2 (E X2 ) = fi[c 2 (E X2 I XI)] = rfo 2 lE X2 I T x )]. 

ERIC 1 7 
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Figures 
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23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98 101 1 

MIDPT I 



■ CSEP*[X(Y2)|X*(Y1)] 

X CSEP(X2|X1) 

□ CSEMP*[X(Y2)|X*(Y1)] 

Y CSEMP(X2 | XI) 

O CSEE*[X(Y2)|X*(Y1)] 

2 CSEE(X2 |X1) 



Figure la. 



Plot of the full-length CSEP(X2 |X1), CSEMP(X2 1 Xl), and CSEE(X2 |Xl) against the stepped-up half-length 
CSEP*[X(Y2)|X*(Y1)], CSEMP*[X(Y2)|X*(Y1)], and CSEE*[(X(Y2)|X*(Y1)]. S 






Figure lb. 

Plot of the full-length CSEP(X2 1 XI), CSEMP(X2 1 XI), and CSEE(X2 1 XI) against the stepped-up half-length 
CSEP* [X(Y2) | X*(Y1)] , CSEMP*[X(Y2) | X*(Y1)], and CSEE*[X(Y2)|X*(Y1)] with quadratic polynomial smoothing 
of the CSEP*, CSEE*,' CSEP, and CSEE. 
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2.6 —j 1 1 | | | | | | | | | | | | | l | | | | | | | | | | 

20 23 26 29 32 35 38 41 44 47 50 53 56 59 62 65 68 71 74 77 80 83 86 89 92 95 98 101 

MIDPT 



B 


CSEP* [X(Y4) I X*(Y3)] 


X 


CSEP(X1 1 X2) 


□ 


CSEMP* [X(Y4) | X*(Y3)] 


Y 


CSEMP(X1 |X2) 


O 


CSEE*[X(Y4) | X*(Y3)] 


z 


CSEE(X1 1 X2) 



Figure 2a. 

Plot of the full-length CSEP(X1 |X2), CSEMP(X1 |X2), and CSEECX1 |X2) against the stepped-up half-length 
CSEP*[X(Y4) |X*(Y3)], CSEMP*[X(Y4) |X*(Y3)], and CSEE*[X(Y4)|X*(Y3)]. 
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Figure 2b. 

Plot of the full-length CSEPCX1 |X2), CSEMPCX1 |X2), and CSEE(X1 |X2) against the stepped-up half-length 
CSEP* [X(Y4) | X*(Y3)] , CSEMP*[X(Y4)|X*(Y3)], and CSEE*[X(Y4)|X*(Y3)] with quadratic polynomial smoothing of 
the CSEP*, CSEE*, CSEP, and CSEE. 
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MIDPT 




B CSEMP* [X(Y2) | X*(Y1)] 

X F-CSEM(E | XI) 

Y T-CSEM(E | XI) 



Figure 3a. 

Plot of the stepped-up half-length CSEMP* [X(Y2) | X*(Y1)] against the Feldt et. al. (1985) ANOVA method 
estimate, F-CSEM(E |X1), and the Thorndike (1951) difference method estimate, T-CSEM(E | XI). 
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Figure 3b. 

Plot of the stepped-up half-length CSEMP*[X(Y2) |X*(Y1)] against the Feldt et. al. (1985) ANOVA method 
F-CSEM(E | XI) and the Thorndike (1951) difference method T-CSEM(E | XI) with quadratic polynomial 
smoothing of the latter. 
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Figure 4a. 

Plot of the stepped-up half-length CSEMP*[X(Y4) |X*(Y3)] against the Feldt et. al. (1985) ANOVA method 
estimate, F-CSEM(E |X2), and the Thorndike (1951) difference method estimate, T-CSEM(E | X2). 
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Figure 4b. 

Plot of the stepped-up half-length CSEMP*[X(Y4)|X*(Y3)] against the Feldt et. al. (1985) ANOVA method 
F-CSEM(E | X2) and the Thorndike (1951) difference method T-CSEM(E |X2) with quadratic ploynomial 
smoothing of the latter. 
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